This data contains 1599 instances of red wine of the Portugeses “Vinho Verde” variety. For each instance, there are 11 variables contain info about the chemical properties of the wine, and 1 rating that corresponds to the quality of the wine. The quality of the wine is based on the average of 3 wine expert ratings, from 0 (very bad) to 10 (very excellent).
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Some observations of these box plots: 1 - The variables with relatively high spread include citric acid, alcohol, and quality. 2 - The variables with relatively low spread include residual sugar, chlorides, and sulphates. 3 - The variables with approximately normal distributions include citric acid, density, pH, alcohol, and quality.
These boxplots are not detailed enough to depict the distribution of each feature, so let’s examine one of the variables a little more closely with a histogram:
Let’s examine the dependent variable quality a little more closely.
This plot shows a histogram of quality. It’s interesting that there were no values less than 3, or greater than 8. It also seems like most of the values for quality were either 5 or 6.
## [1] 0.8248906
About 83.5% of all observations had quality = 5 or 6. This confirms my observation from the histogram.
My initial theory is alcohol content will have an effect on quality, so I’m going to break down the alcohol variable by quartile, and store this data in a new variable called alcohol.quartile with values of ‘low’, ‘mid-low’, ‘mid-high’, and ‘high’ depending on the quartile the alcohol content falls into.
##
## high low mid-high mid-low
## 407 297 396 499
Now I will break down the volatile acidity into halves (above median vs. below median) and label each row with’low’ or ‘high’ and store this data ina new variable vol.acid.half. I will also create a factor variable for volatile acidity quartile to order it correctly for plotting.
##
## high low
## 816 783
Now I will create a factor variable for alcohol quartile to order it correctly for plotting.
It looks like the buckets are close to evenly distributed.
Let’s break the sulphates into buckets as well by rounding each value down to the nearest tenth and storing the result in a variable called sulphates_bucket.
##
## 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.5 1.6 1.9 2
## 9 142 503 446 251 138 51 22 18 5 6 2 2 3 1
There is a low amount of wine with sulphates>=1, so let’s group them all into one bucket. Also, lets group the .3’s in with the .4’s because there are only 9 wines with a value of .3.
##
## 0.4 0.5 0.6 0.7 0.8 0.9 1
## 151 503 446 251 138 51 59
This is the number of occurences for each sulphates bucket. It looks like cleaner data to work with!
Let’s create a histogram to confirm.
The sulphates buckets look approximately normally distributed.
The dataset includes 1599 observations of 13 variables. All 13 variables are are of the type number except X and quality, which are integers.
The main feature of interest in my dataset is the dependent variable quality.
As of now, any of the 11 independent variables can help support my investigation into my feature of interest. I will have to explore them more in the bivariate section.
Yes, I created the alcohol.quartile variable based on the quartile that the alcohol variable falls into for each observation.
I also created wine$alc.qua.fac to organize alcohol.quartile in order.
I also created a new variable to classify volatile acidity into buckets.
I also created a new variable to classify sulphates into buckets.
Both residual sugar and chlorides had very long tails skewed to the right.
Sulphates also had a relatively long tail skewed right, so when I grouped sulphates into buckets with a new variable, I grouped the outliers into buckets that will make the data easier to visualize going forward.
Let’s create a box plot of alcohol vs. quality
Based on these box plots, it looks like alcohol is positively correlated with quality. Let’s investigate further with a scatterplot.
This scatter-plot shows a positive correlation between alcohol content and quality, affirming my suspicion.
Let’s see how the rest of the variables correlate with quality.
Based on these plots, it looks like alcohol, sulphates, and citric acid have relatively large positive correlation with quality. Also, it looks like volatile acidity, chlorides, and total sulfur dioxide have relatively large negative correlation with quality.
Let’s calculate the Pearson correlation coefficient between all numeric variables
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
Based on these correlation coefficients, it looks like alcohol has the highest correlation with quality with a Pearson corr coef = .476. The next highest is sulphates with R = .251 and then citric acid with R=.226.
The most negative correlation coefficients are volatile acidity with R = -.391 and density with R = -.175.
This confirms what what I saw in the earlier graphs.
The strongest relationship is fixed acidity to citric acid with R = .672, and the most negative relationship is fixed acidity to pH with R = -.683. These variables would logically have strong relationships because they all describe acidity.
The highest R value for a relationship that stuck out to me was density to fixed acidity with R = .668. Let’s explore that more.
High correlation in this graph!
Let’s look at alcohol vs. volatile acidity because it looks like these two variables have the strongest correlations to quality
There’s a slight negative correlation, which makes sense because they had a correlation coefficient = -.202
Now I want to explore volatile acidity’s correlation with quality, because I noted that it has a strong negative correlation.
This shows as the quality increases, the volatile acidity decreases.
Now I want to explore sulphates’s correlation with quality, because I noted that it has a relatively high positive correlation.
The slight positive correlation appears in this chart.
Let’s group the quality into groups and analyze the data that way.
## # A tibble: 6 × 4
## quality alc_mean alc_median n
## <int> <dbl> <dbl> <int>
## 1 3 9.955000 9.925 10
## 2 4 10.265094 10.000 53
## 3 5 9.899706 9.700 681
## 4 6 10.629519 10.500 638
## 5 7 11.465913 11.500 199
## 6 8 12.094444 12.150 18
Let’s perform more analysis on other variables:
Not much correlation here.
Slight negative correlation between quality and density
Slight negative correlation here.
Based on these correlation coefficients, it looks like alcohol has the highest correlation with quality (Pearson corr coef = .476). The next highest is sulphates with R = .251 and then citric acid with R=.226.
The most negative correlation coefficients with quality are volatile acidity with R = -.391 and density with R = -.175.
The strongest relationship out of all the data is fixed acidity to citric acid with R = .672, and the most negative relationship is fixed acidity to pH with R = -.683. These variables would logically have strong relationships because they all describe acidity.
The highest R value for a relationship that stuck out to me was density to fixed acidity with R = .668.
As mentioned earlier, the strongest relationship I saw was fixed acidity to citric acid with R = .672. However, this makes sense because they both describe acidity.
The highest R value for two variables that didn’t definitely describe the same thing was density to fixed acidity with R = .668.
Let’s look at the two strongest correlated variables with quality in one graph, alcohol and volatile acidity.
This chart isn’t very useful because it’s too busy. Let’s try a scatterplot.
It looks like what I expected. The higher quality wine is more purple (high alcohol content), and falls towards the left on the graph (low volatile acidity).
Now let’s try setting alcohol as the x-axis and volatile acidity as the categorical variable.
This is a good looking chart! It clearly shows as alcohol content increases quality increases, and lower volatile acidity corresponds to higher quality.
Let’s look at alcohol vs. volatile acidity broken into sulphate groups. These are the 3 most correlated variables to quality
This graph doesn’t seem to imply much correlation between these variables at all.
Let’s look at sulphates bucket vs. alcohol colored by quality.
Again this isn’t a very useful chart, because it doesn’t show much correlation.
Let’s look at quality against sulphates, grouped by alcohol bucket
This chart is a little too busy to gather much insight.
Let’s look at alcohol vs. volatile acidity, grouped by quality
This plot shows that as volatile acidity is lower, the quality is higher. It also shows that as alcohol is higher, the quality is higher. Finally, it shows that alcohol has a very weak negative correlation with volatile acidity.
I observed that alcohol content corresponds to increased quality, whereas volatile acidity content corresponds to decreased quality.
Sulphates seemed arbritrary with respect to quality.
The most interesting interaction between features is that alcohol content corresponds to increased quality, whereas volatile acidity content corresponds to decreased quality.
Other than that, it wasn’t too interesting because there weren’t many attributes that had a large affect on quality.
I did not create any models with my dataset.
I chose this plot because it shows how each of the attributes can affect the variable of intereset, quality. It helped me gather insight into the data to know which attributes I should explore more closely, and which attributes I can probably disregard because they have a loose connection to quality.
This plot shows how the quality of wine varies directly with alcohol content. To make this plot, I first plotted a scatter plot of quality vs. alcohol, and then I plotted a boxplot for each value of quality vs. alcohol on top of the scatter plot. Finally, I showed the mean value of alcohol content for each quality group with a red dot. This plot shows a strong correlation between quality and alcohol, and it seems alcohol content drives the quality score more than any other attribute.
I chose this plot because it shows the two attributes with the strongest correlation to quality. As Alcohol % by volume increases, the quality increases. On the other hand, wines with high Volatile Acidity are more likely to be a lower quality wine than wines with low Volatile Acidity.
This project taught me a lot about how to use R to perform Exploratory Data Analysis. I chose this dataset because I like to drink wine, and I was looking forward to seeing if there were any scientific reasons as to why one bottle of wine was better than another bottle of wine. I was able to gather some insight into what makes experts consider a bottle of wine good. However, the dataset did not allow me to gain as much insight as I would’ve liked for several reasons.
If these 3 experts’ ratings are to be trusted, there were many insights I found into wine quality. First of all, I learned that higher alcohol content often contributes to better quality wine. Also, I learned that high volatile acidity contributes to lower quality wine. This makes sense because this is “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.” Out of the other 9 variables, I couldn’t gather too much insight into what constitutes good quality wine, because they had low correlations to the quality variable.
Unfortunately, the dataset I chose didn’t allow me to make too many interesting insights or visualizations for several reasons. First of all, the quality values were only integers. This was very frustrating because if this value took fractions, I could’ve gathered more accurate insights into how other variables contributed to it. Also, the average of the 3 experts’ ratings naturally should’ve been decimals, so it was confusing why the data only had integers.
Another reason the data wasn’t great was there wasn’t much correlation between the variables. For example no variable given in the dataset had a higher correlation coefficient than .5 with quality.
If there was better correlation, the graphs would’ve looked better and more insightful for my audience.
If future work was going to be done with this dataset, I would suggest collecting data from more than 3 experts, and then storing the quality rating as a decimal in order to get more granular insight into how other variables contributed to the rating.
Overall, I got valuable experience working with R and performing Exploratory Data Analysis, but I wish I had chosen a better dataset to work with for this project.
Thanks for reading!
www.stackoverflow.com
https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html